Batch attestation duty scheduling for an epoch#9374
Batch attestation duty scheduling for an epoch#9374StefanBratanov merged 11 commits intoConsensys:masterfrom
Conversation
| private static final int BATCH_SIZE = 2500; | ||
| private static final Duration BATCH_DELAY = Duration.ofSeconds(3); |
There was a problem hiding this comment.
i was thinking more slots...
2500 is a lot (but may be ok if its configurable)
Theres some use cases
- absolute worst case is bootstrapping and a large validator set. (would include a range sync here)
- best case is we're already running, because we're just padding out our current set of duties
im not sure if we need to handle these separately, but basically, i think something slot based should solve both scenarios
- any slots in the 'past' (by time computation) can be excluded
- current slot needs to be known ASAP (if we're late into that we may as well not sub)
- assuming current slot is not in our list of duties (this would be if we already have an epoch of duties computed and we're filling in epoch+1), then we can just do 1 slot of signatures and then sleep 1, then do another slot.
- if we're computing current epoch duties, then we probably need to be doing 3-4 slots in a batch...
so at start, we have 2 epochs of duties to figure out if we're aggregators
- exclude all 'historic' slots
- order by slot
- if we're current epoch, iterate 1 slot at a time for signatures. this gives gaps while we wait for that slot to complete if we're trying to get current duties in. sleep 500ms every period (3-4 slots)
- so if we have slots 10-32 of our current to consider, then we iterate those in batches, but 1 slot at a time (sleeping 1s every ... say 3rd-4th... slot)
- from 33-64, we would just do a single slot then sleep 1s. this would spread the sigs over a few slots.
If we have less than about 1k validators we likely don't care for this overhead, but once we get to large numbers like some of our test infra, this would help a lot.
On 32k validators, that looks like on average 1k / slot, so by spacing them out we get space to do things like sign actual duties.
Concretely,
eg1.
slot 32, i have already computed my current epoch (from the previous time) so i get only my future epoch. I have 32k sigs to do so i do 1k (average) sleep 1s, repeat and it takes me 32 seconds to get slots 33-64 aggregation duties.
eg2.
Slot 1, if i haven't computed any duties before, then i have 64 duties (current and next epoch)
i do slot 1 computes straight away, in the hope i'm not too late. then without a rest i do 2, then 3. that gives me all the data i need for the first few slots.
I sleep 500ms, do slots 4,5, then 6. rinse, repeat until i get to 33. (would take... 6 seconds ish of sleeps)
at 33, i do that, sleep 1s. 34, sleep 1s. repeat. (would take over half a minute not including signing time)
eg3.
slot 29, i haven't computed duties before, i get duties from current/next epoch.
I exclude 1-28, as they're historic and i've already missed those aggregating tasks.
I compute for 29. compute 30, 31, sleep 500ms.
then compute 32 which is last of current epoch, so sleep 500ms.
then 33, sleep 1s etc.
I guess theres still the edge case where you have more than 1k sigs per slot but i think at that point we have bigger issues...
This was a big comment, but kind of encapsulates some considerations i guess.
dd65c65 to
bc5d55a
Compare
| final boolean isCurrentEpoch = | ||
| spec.computeEpochAtSlot(currentSlot.get()).isLessThanOrEqualTo(epoch); |
There was a problem hiding this comment.
ahm... so current epoch should be just equals?
| // expensive aggregation slot signing in beginning of the epoch when a node is running a large | ||
| // number of validators | ||
| @VisibleForTesting | ||
| record SlotBatchingOptions( |
rolfyone
left a comment
There was a problem hiding this comment.
LGTM, we should see how it goes. I wonder if we also want to take timings of each batch between sleeps but that could definitely be separate to this PR.
|
@lucassaldanha I am not 100% sure how |
7698253 to
f647aeb
Compare
|
After: |
08e08a7 to
f9359c3
Compare
* Add read functionality to blobs archiving (#9318) * Schedule Gnosis Pectra hard-fork (#9340) * Updated ref tests v1.5.0-beta.5 (#9341) * Created Fulu skeleton (#9325) * Made checkpoint-state-url and initial-state mutually exclusive (#9342) It's hard to reason on how these two should interact, and logically they're mutually exclusive, so this just solidifies the theory that they shouldn't be used together. Signed-off-by: Paul Harris <paul.harris@consensys.net> * partial 3rd party updates and errorprone updates (#9351) I ended up disabling the instanceof errorprone, as it's very very common for us to be doing it apparently. I started changing things but in the interest of time will raise a ticket Signed-off-by: Paul Harris <paul.harris@consensys.net> * Make builder timeouts only for the HTTP call (#9353) Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net> * more 3rd party updates (#9355) Signed-off-by: Paul Harris <paul.harris@consensys.net> * KZG updates from das branch (#9335) * Avoid combining validators by aggregating committee when using DVT (#9357) * builder json format, make media type more compatible (#9360) * refactored expected withdrawals and added test cases (#9361) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Use `URI.create(...).toURL()` removing URL constructor deprecation (#9364) * clear-changelog (#9366) * New infra stream module (#9362) * feat: Add support for reading static peers from file (#9328) Co-authored-by: Paul Harris <paul.harris@consensys.net> * Start cleaning up `api/schema` (#9376) * Start cleaning up `api/schema` - moved `ValidatorStatus` - Removed `Version` - Moved `PublicKeyException` - Moved `EventType` Signed-off-by: Paul Harris <paul.harris@consensys.net> * cleanup the rest of api/schema (#9377) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Added KZG computeCells method (#9375) * Added KZG computeCells method * Moved KZG no-operation logic into testFixture class * Fulu Schemas (ssz + datastructures) - including fulu ssz ref tests (#9363) * Added context to the sync committee duties error (#9380) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Returning custody_group_count on node/identity Beacon API (#9381) * may 3rd party updates (#9388) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Added snakeyaml to allow more complicated config reading (#9390) - added tests to show reading lists of objects (BPO related) - added a testcase to show that hex is read correctly - added testcase to show specConfigReader isnt filling defaults when we read local config files Partially addresses #9365 Signed-off-by: Paul Harris <paul.harris@consensys.net> * Add FULU to the Spec Factory (#9373) * Improve attestation bits aggregator electra (#9393) * added a DeserializableConfigTypeDefinition (#9396) - Tests demonstrate that this should at least be very close to what we require for reading BPO schedule. Partially addresses #9365 Signed-off-by: Paul Harris <paul.harris@consensys.net> * Updating ref test to 1.5.0 stable (#9398) * uniforms the AggregatingAttestationPool interface (#9401) * Avoids potential mutability of the result of getCommitteeBits (#9402) * Adding DataColumnsByRootIdentifier container (#9399) * Added an info message for the highest milestone (#9405) Also added a sanity check that will fail if the specFactory hasn't defined the highest milestone correctly, which has flow on effects. fixes #9400 Signed-off-by: Paul Harris <paul.harris@consensys.net> * Introduce `PooledAttestation` and `PooledAttestationWithData` (#9407) * Improve jdk 24 support and add docker-jdk24 (#9410) * add docker-jdk24 support * solves jdk 24 startup warning * rename AttestationBitsAggregator to AttestationBits (#9412) * Introduce aggregating pool interface (#9414) * [docker-jdk24] SIGHUP handler fix (#9413) * Optionally include validator indices in `PooledAttestation` (#9418) * Improve committeesSize retrieval in `AggregatingAttestationPool` (#9419) * Batch attestation duty scheduling for an epoch (#9374) * Updated MiscHelpersFulu + Added Gossip Logger (#9422) * Updated MiscHelpersFulu * Added GossipLogger and DAS related changes * gossip DAS related changes * fix test * fix runtime with noop fulu managers * bump MAX_SUBSCRIBED_TOPICS for Fulu and further forks * Use master's circle ci config * Remove duplicated workflows * Update test fixtures with random order changes * fix test * fix test * fix GetNewBlockV3Test * fix GetBlockTest * Gossip DAS related changes (#9423) * added bpo parsing to configuration (#9406) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Fix ProduceBlockRequestTest for Fulu and updated order in random * Regenerate test fixtures with the changes in DataStructureUtil call order * Regenerate test fixtures with the changes in DataStructureUtil call order for GetNewBlockV3IntegrationTest * spotless * move GetDataColumnSidecars to tekuv1 space, fix openapi integration test * BPO - remove max blobs from fulu spec (#9427) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Introduces `RewardBasedAttestationSorter` (#9428) * Fix integration tests after master upstream * Fix property tests + refactor some random datastructure utils to avoid fulu duplication * fix assemble * fix test * revert das test coverage changes * DAS storage changes (#9432) * Added config defaulting to the Config loader (#9439) Signed-off-by: Paul Harris <paul.harris@consensys.net> Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * Introduce `AttestationMatchingGroupV2` (#9438) * Fix DasSyncAcceptanceTest, fix DSL, remove broken DSL --------- Signed-off-by: Paul Harris <paul.harris@consensys.net> Co-authored-by: Stefan Bratanov <stefan.bratanov93@gmail.com> Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net> Co-authored-by: Paul Harris <paul.harris@consensys.net> Co-authored-by: Enrico Del Fante <enrico.delfante@consensys.net> Co-authored-by: crStiv <cryptostiv7@gmail.com> Co-authored-by: Mehdi AOUADI <mehdi.aouadi@consensys.net> Co-authored-by: Lucas Saldanha <lucascrsaldanha@gmail.com> Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com>
* Add read functionality to blobs archiving (#9318) * Schedule Gnosis Pectra hard-fork (#9340) * Updated ref tests v1.5.0-beta.5 (#9341) * Created Fulu skeleton (#9325) * Made checkpoint-state-url and initial-state mutually exclusive (#9342) It's hard to reason on how these two should interact, and logically they're mutually exclusive, so this just solidifies the theory that they shouldn't be used together. Signed-off-by: Paul Harris <paul.harris@consensys.net> * partial 3rd party updates and errorprone updates (#9351) I ended up disabling the instanceof errorprone, as it's very very common for us to be doing it apparently. I started changing things but in the interest of time will raise a ticket Signed-off-by: Paul Harris <paul.harris@consensys.net> * Make builder timeouts only for the HTTP call (#9353) Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net> * more 3rd party updates (#9355) Signed-off-by: Paul Harris <paul.harris@consensys.net> * KZG updates from das branch (#9335) * Avoid combining validators by aggregating committee when using DVT (#9357) * builder json format, make media type more compatible (#9360) * refactored expected withdrawals and added test cases (#9361) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Use `URI.create(...).toURL()` removing URL constructor deprecation (#9364) * clear-changelog (#9366) * New infra stream module (#9362) * feat: Add support for reading static peers from file (#9328) Co-authored-by: Paul Harris <paul.harris@consensys.net> * Start cleaning up `api/schema` (#9376) * Start cleaning up `api/schema` - moved `ValidatorStatus` - Removed `Version` - Moved `PublicKeyException` - Moved `EventType` Signed-off-by: Paul Harris <paul.harris@consensys.net> * cleanup the rest of api/schema (#9377) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Added KZG computeCells method (#9375) * Added KZG computeCells method * Moved KZG no-operation logic into testFixture class * Fulu Schemas (ssz + datastructures) - including fulu ssz ref tests (#9363) * Added context to the sync committee duties error (#9380) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Returning custody_group_count on node/identity Beacon API (#9381) * may 3rd party updates (#9388) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Added snakeyaml to allow more complicated config reading (#9390) - added tests to show reading lists of objects (BPO related) - added a testcase to show that hex is read correctly - added testcase to show specConfigReader isnt filling defaults when we read local config files Partially addresses #9365 Signed-off-by: Paul Harris <paul.harris@consensys.net> * Add FULU to the Spec Factory (#9373) * Improve attestation bits aggregator electra (#9393) * added a DeserializableConfigTypeDefinition (#9396) - Tests demonstrate that this should at least be very close to what we require for reading BPO schedule. Partially addresses #9365 Signed-off-by: Paul Harris <paul.harris@consensys.net> * Updating ref test to 1.5.0 stable (#9398) * uniforms the AggregatingAttestationPool interface (#9401) * Avoids potential mutability of the result of getCommitteeBits (#9402) * Adding DataColumnsByRootIdentifier container (#9399) * Added an info message for the highest milestone (#9405) Also added a sanity check that will fail if the specFactory hasn't defined the highest milestone correctly, which has flow on effects. fixes #9400 Signed-off-by: Paul Harris <paul.harris@consensys.net> * Introduce `PooledAttestation` and `PooledAttestationWithData` (#9407) * Improve jdk 24 support and add docker-jdk24 (#9410) * add docker-jdk24 support * solves jdk 24 startup warning * rename AttestationBitsAggregator to AttestationBits (#9412) * Introduce aggregating pool interface (#9414) * [docker-jdk24] SIGHUP handler fix (#9413) * Optionally include validator indices in `PooledAttestation` (#9418) * Improve committeesSize retrieval in `AggregatingAttestationPool` (#9419) * Batch attestation duty scheduling for an epoch (#9374) * Updated MiscHelpersFulu + Added Gossip Logger (#9422) * Updated MiscHelpersFulu * Added GossipLogger and DAS related changes * Gossip DAS related changes (#9423) * added bpo parsing to configuration (#9406) Signed-off-by: Paul Harris <paul.harris@consensys.net> * BPO - remove max blobs from fulu spec (#9427) Signed-off-by: Paul Harris <paul.harris@consensys.net> * Introduces `RewardBasedAttestationSorter` (#9428) * DAS storage changes (#9432) * Added config defaulting to the Config loader (#9439) Signed-off-by: Paul Harris <paul.harris@consensys.net> Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * Introduce `AttestationMatchingGroupV2` (#9438) * Introduce `AggregatingAttestationPoolV2` (#9445) * Introduce `AggregatingAttestationPoolProfiler` implementations (#9447) Co-authored-by: Paul Harris <paul.harris@consensys.net> * PeerDAS RPC related changes (#9444) * Added a development flag to enable strict config loading (#9449) Strict loading was the default until yesterday, but it's caused a number of problems because if we push changes too fast etc it causes support issues. The new default permissive loading would mean that if a passed config file is missing an entry, we default it and print a warning. This new flag `--Xstartup-strict-config-loader-enabled` defaults to false, but if turned on will revert to failing on startup if a config being loaded doesn't include all of the expected configuration attributes. We could potentially start using this in AT to specify minimal changes to a configuration for a test, but for now I've left acceptance tests using strict loading. Signed-off-by: Paul Harris <paul.harris@consensys.net> * Updating reference tests to 1.6.0-alpha.0 (#9450) * Updating reference tests to 1.6.0-alpha.0 * Added Fulu networking tests * rename * Refactoring networking ATs * Add peer via api (#9431) * add initial impl of endpoint Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * add catch to cover address conversion failure Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * add tests for 200, 400 and 500 Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * spotless Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * add test for when discovery is not enabled Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * spotless Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * fix name of test after changing exception returned for invalid peer Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * add json def to paths Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * add changelog Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * PR review Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * spotless caught me again Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * change impl and tests to handle a single peer per api call Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * spotless Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> --------- Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> * `AggregatingAttestationPool` new CLI params (#9452) * poolV2-CLI * fix arity * getBits method names clearer (#9453) * Validate electra attestationData index received by BN (#9430) * fix integration test --------- Signed-off-by: Paul Harris <paul.harris@consensys.net> Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com> Co-authored-by: Stefan Bratanov <stefan.bratanov93@gmail.com> Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net> Co-authored-by: Paul Harris <paul.harris@consensys.net> Co-authored-by: Enrico Del Fante <enrico.delfante@consensys.net> Co-authored-by: crStiv <cryptostiv7@gmail.com> Co-authored-by: Mehdi AOUADI <mehdi.aouadi@consensys.net> Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com>
PR Description
Introduce
AttestationDutyDefaultSchedulingStrategyandAttestationDutyBatchSchedulingStrategyBatching by slots with added delays is used when there are lot of duties to be scheduled (excessive signing at the beginning of an epoch)
Fixed Issue(s)
fixes #9292
Documentation
doc-change-requiredlabel to this PR if updates are required.Changelog